Stratification Based Hierarchical Clustering Over a Deep Web Data Source

نویسندگان

  • Tantan Liu
  • Gagan Agrawal
چکیده

This paper focuses on the problem of clustering data from a hidden or a deep web data source. A key characteristics of deep web data sources is that data can only be accessed through the limited query interface they support. Because the underlying data set cannot be accessed directly, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. Unlike existing sampling based methods, sampling costs, and not the computation or memory costs, are the dominant consideration in designing the technique for sampling. We have developed a new methodology for addressing the clustering problem on the deep web. Our work includes three new ideas, which are a method for stratifying a deep web data source, an algorithm for hierarchical clustering based on stratified sampling, and a two phase technique for sampling, which includes a representative sampling in the first phase, and sampling focusing on the boundary points between the clusters in the second phase. We have evaluated our approach using two synthetic and one real data set. Our experiments show that each of the three ideas we have introduced leads to significant improvements in accuracy and efficiency of clustering a hidden data source. Specifically, we improve the accuracy of the clusters obtained (measured by average distance to centers) by up to 20% over the existing approach. Compared in another way, our method can achieve the same accuracy with up to 25% fewer samples, thus reducing the sampling cost.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparison of Clustering Methods over a Hidden Web Data using Stratification

This paper’s centre of attention is on the problem of data mining (in general) and clustering (in specific) on a hidden web data. We know that data mining is a process that analyzes and extracts knowledge from large amounts of data which provides useful information to users. Hidden or deep web data is the database located at remote system .So, to access such data, we need query interface or HTM...

متن کامل

Stratification-Based Outlier Detection over the Deep Web

For many applications, finding rare instances or outliers can be more interesting than finding common patterns. Existing work in outlier detection never considers the context of deep web. In this paper, we argue that, for many scenarios, it is more meaningful to detect outliers over deep web. In the context of deep web, users must submit queries through a query interface to retrieve correspondi...

متن کامل

Clustering Structured Web Sources: A Schema-Based, Model-Differentiation Approach

The Web has been rapidly “deepened” with the prevalence of databases online. On this “deep Web,” numerous sources are structured, providing schema-rich data– Their schemas define the object domain and its query capabilities. This paper proposes clustering sources by their query schemas, which is critical for enabling both source selection and query mediation, by organizing sources of with simil...

متن کامل

Hierarchical Fuzzy Clustering Semantics (HFCS) in Web Document for Discovering Latent Semantics

This paper discusses about the future of the World Wide Web development, called Semantic Web. Undoubtedly, Web service is one of the most important services on the Internet, which has had the greatest impact on the generalization of the Internet in human societies. Internet penetration has been an effective factor in growth of the volume of information on the Web. The massive growth of informat...

متن کامل

Analysis of Valuable Clustering Techniques for Deep Web Access and Navigation

A massive amount of content is available on web but huge portion of it is still invisible. User can only access this hidden web, also called Deep web, by entering a directed query in a web search form and thus accessing the data from database which is not indexed with hyperlinks. Inability to index particular type of content and restricted storage capacity is significant factor behind the invis...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012